Applying Multidimensional Scaling (MDS) to Trader Joe’s Cheese

Multidimensional scaling (MDS) is a method that allows you to create a low-dimensional model of a set of objects that maintains as much of the inter-object distance relationship as possible. MDS does this using a set of pair-wise distances for a set of objects. It takes this and creates a set of points in a low-dimensional Euclidean space, where each point corresponds to a different object in the set such that the pair-wise distances between the points are as similar as possible to the corresponding pair-wise distances between the objects.

The set of objects does not have to have a known high-dimensional model. Moreover, the choice of how the distances between the objects are defined is up to us; we can choose among many distance metrics - one simple example is euclidean distance, which is what I will be using to create my models in the following sections. We then create a symmetric distance matrix, \(D\), in which \(D_{ij}\) is the distance between objects \(i\) and \(j\) and \(D_{ij} = D_{ji}\) and \(D_{ii} = 0\).

I will applied MDS to a dataset I created by webscraping nutritional and other information on cheeses from the Trader Joes’s website: https://www.traderjoes.com/home/products/category/cheese-29

Preparing the Data to Model

The process to clean the data was tedious and done in Python. This process is commented well in the python notebook on the repository page here: https://github.com/lindseygao/mds-tj-cheese. The repository also includes the clean dataset, named clean_df.

After cleaning the data, which involved dropping necessary columns and filling in missing data, I ended with a dataset with the following features:

## Rows: 31
## Columns: 10
## $ price              <dbl> 5.980000, 4.920000, 7.990000, 6.653333, 11.440000, ~
## $ serving.size       <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,~
## $ calories           <dbl> 70.00000, 100.00000, 100.00000, 110.00000, 90.00000~
## $ total.fat          <dbl> 6.000000, 8.000000, 8.000000, 8.000000, 7.000000, 6~
## $ saturated.fat      <dbl> 3.5, 5.0, 6.0, 6.0, 4.5, 5.0, 8.0, 3.5, 7.0, 3.5, 3~
## $ cholesterol        <dbl> 20.00000, 25.00000, 25.00000, 25.00000, 25.00000, 2~
## $ sodium             <dbl> 80.0000, 150.0000, 320.0000, 230.0000, 200.0000, 19~
## $ total.carbohydrate <dbl> 0.0, 3.0, 0.0, 0.0, 2.0, 0.0, 0.0, 3.0, 3.0, 0.0, 5~
## $ protein            <dbl> 5, 6, 6, 7, 6, 6, 6, 4, 0, 3, 4, 7, 6, 0, 7, 6, 0, ~
## $ calcium            <dbl> 80, 190, 150, 250, 200, 150, 150, 40, 0, 40, 0, 150~
##   price serving.size calories total.fat saturated.fat cholesterol sodium
## 1  5.98           28       70         6           3.5          20     80
## 2  4.92           28      100         8           5.0          25    150
## 3  7.99           28      100         8           6.0          25    320
##   total.carbohydrate protein calcium
## 1                  0       5      80
## 2                  3       6     190
## 3                  0       6     150

The units of the columns are as follows:

Note that the nutrition information has been calculated/standardized for 1 serving size of 28 grams.

Since the units and scales of columns are different, we need to adjust/normalize the columns of the input data so that they are all “comparable.” This is to ensure that the scale of each dimension of the vectors does not overly affect the distance calculations and, hence, our MDS model.

I normalized the data using the min-max normalization method so that the results are not skewed by the units of each observation. Min-max normalization calculates the min and max value in each column, and then maps each column entry x to (x-min)/(max-min). This transformation of the data will result in a new data set in which each column has a minimum of 0 and a maximum of 1.

Initial Plots

Below is the eigenvalue plot of the model:

We see that the first 4 eigenvalues are pretty large.

The first eigenvalue captures 0.446 of the total energy (proportion of the total eigenvalues) in the data. The first 2 eigenvalues capture 0.679 of the total energy and the first 3 eigenvalues capture 0.772 of the total energy.

One-Dimensional Model

Below is the plot of the one dimensional model of the data:

We can also plot how the distances produced in the one-dimensional model differ from the original distances:

The line y = x is also plotted (indicating a perfect fit). We observe that the observations are somewhat close to the line but noticeable deviations are visible.

We can evaluate the model interms of 3 additional metrics: (1) the goodness-of-fit (GOF), (2) mean absolute difference and the (3) mean squared difference of the model’s distance and the true distances.

Below is an interactive plot of the one dimensional model of the data to easily see the labels of variou cheeses. Note that the color scale is the price of the cheese in dollars/lb.

It’s interesting to note that the vegan cheese options are clustered together and the goat cheese and feta cheese options are also somewhat grouped together. However, the majority of cheeses seem to reside on the left side (many of which are cheddar cheeses).

Two-Dimensional Model

Below is a plot of the 2 dimensional model:

Below is the distance plot of the 2 dimensional model:

We observe that the distance plot of the two-dimensional model fit the line \(y = x\) much better than the one-dimensional model.

The additional evaluations metrics for the two dimensional model are:

Below is an interactive plot of the two dimensional model of the data. Again, the color scale is the price of the cheese in dollars/lb.

In the two dimensional model, we once again see the vegan cheeses closely clustered together, signifying much dissimilarity from the other cheeses (which is to be expected based on nutritional composition). We also see ricotta is all the way in the far upper right corner and is the only ricotta cheese in this dataset. Again, we see that the many variations of cheddar cheese are also clustered together towards the left. A fun note is that unique cheeses “garlic bread cheese” and “pizza bread cheese” are also in the big cheddar cheese cluster and right next to each other.

Three-Dimensional Model

Below is the distance plot comparing the distances of the three-dimensional model and the true normalized distances:

We observe that the distance plot of the three-dimensional model fit the line \(y = x\) slightly better than the two-dimensional model, but the difference is not as great as increasing from one to two dimensions. Again, I have calculated the additional evaluation metrics below:

Summary of Models

Below is a table summary of the various evaluation metrics for the three different models:

##                   1D Model 2D Model 3D Model
## GOF               0.4465   0.6791   0.7717  
## Mean Abs Diff     0.4226   0.2461   0.1671  
## Mean Squared Diff 0.2624   0.0937   0.0429

We see that there are huge improvements to the model when increase the dimension from one to two and a smaller improvement from increasing the dimension from two to three. This aligns with what we saw from the distance plots.